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Identifying Biochemical Reaction Networks From Heterogeneous Datasets 

Wei Pan, Ye Yuan*, Lennart Ljung, Jorge Gonfalves and Guy-Bart Stan 


Abstract — In this paper, we propose a new method to identify 
biochemical reaction networks (i.e. both reactions and kinetic 
parameters) from heterogeneous datasets. Such datasets can 
contain (a) data from several replicates of an experiment 
performed on a biological system; (b) data measured from 
a biochemical network subjected to different experimental 
conditions, for example, changes/perturbations in biological 
inductions, temperature, gene knock-out, gene over-expression, 
etc. Simultaneous integration of various datasets to perform sys¬ 
tem identification has the potential to avoid non-identifiability 
issues typically arising when only single datasets are used. 

I. Introduction 

The problem of identifying biological networks from 
experimental time series data is of fundamental interest 
in systems and synthetic biology [22]. For example, such 
information can aid in the design of drugs or of synthetic 
biology genetic controllers. Tools from system identification 
[2] can be applied for such purposes. However, most system 
identification methods produce estimates of model structures 
based on data coming from a single experiment. 

The interest in identification methods able to handle 
several datasets simultaneously is twofold. Firstly, with the 
increasing availability of “big data” obtained from sophis¬ 
ticated biological instruments, e.g. large ‘omics’ datasets, 
attention has turned to the efficient and effective integration 
of these data and to the maximum extraction of information 
from them. Such datasets can contain (a) data from repli¬ 
cates of an experiment performed on a biological system 
of interest under identical experimental conditions; (b) data 
measured from a biochemical network subjected to different 
experimental conditions, for example, different biological 
inducers, temperature, stress, optical input, gene knock-out 
and over-expression, etc. The challenges for simultaneously 
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considering heterogeneous datasets during system identifi¬ 
cation are: (a) the system itself is unknown, i.e. neither 
the structure nor the corresponding parameters are known; 
(b) it is unclear how heterogeneous datasets collected under 
different experimental conditions influence the “quality” of 
the identified system. 

Secondly, in control or synthetic biology applications the 
systems to be controlled typically need to be modelled first. 
Highly detailed or complex models are typically difficult to 
handle using rigorous control design methods. Therefore, 
one typically prefers to use simple or sparse models that 
capture at best the dynamics expressed in the collected 
data. The identification and use of simple or sparse models 
inevitably introduces model class uncertainties and parameter 
uncertainties [3], [4]. To assess these uncertainties replicates 
of multiple experiments is typically necessary. 

Our approach is based on the concept of sparse Bayesian 
learning [5], [6] and on the definition of a unified optimisa¬ 
tion problem allowing the consideration of different param¬ 
eter values for different experimental conditions, and whose 
solution is a model consistent with all datasets available for 
identification. The ability to consider various datasets si¬ 
multaneously can potentially avoid non-identifiability issues 
arising when a single dataset is used [7]. Furthermore, by 
comparing the identified parameter values associated with 
different conditions, we can pinpoint the influence specific 
experimental changes have on system parameters. 

The notation in this paper is standard and can be found in 
SI Section [ST] 

II. Model 

We consider dynamical systems described by nonlinear 
differential/difference equation with additive noise: 

— fn , 11^) ^ — J ; • ■ • : tlx 
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where 6{xnt) = int for continuous-time system; 6{xnt) = 
Xnt or Xnt — Xn,t-i or some known transformation of histor¬ 
ical data for discrete-time system; Vns G R and /„s(xt, Uj) : 

_)■ K are basis functions that govern the dynamics. 
To ensure existence and uniqueness of solutions, the func¬ 
tions fns{xt,Ut) are assumed to be Lipschitz continuous. 
Note that we do not assume a priori knowledge of the form 
of the nonlinear functions appearing on the right-hand side 
of the equations in Q, e.g. whether the degradation obeys 
first-order or enzymatic catalysed dynamics or whether the 
proteins are repressors or activators. 

When data are sampled, we assume the data matrix and 
first derivative/difference data matrix satisfying 0 can be 
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respectively. 

Based on these defined data matrices, the system in ([T]) can 
be written as y„ = + n = 1 

... ,(5(a;„M)]^ € K 

= [^ni,---AuM)]' e and the dic¬ 
tionary matrix S ^mxn„ column being 

[/nj(xi, Ui), ..., fni{y^M, um)]^- The noise or disturbance 
vector is assumed to be Gaussian distributed with zero 
mean and covariance 11 S The identification 

goal is to estimate v„ of the linear regression formulation 
y„ = n = 1,..., rix- Two issues need to be 

raised here. The first one is the selection of basis function 
/ns{’, •) which is key to the success of identification. Some 


discussion on this can be found in SI Section S2 especially 
for biochemical reaction networks. The second one is the 
estimation of the first derivative data matrix which is not 

we provide a method to estimate 


trivial. In SI Section S3 


first derivatives from noisy time-series data. 

If a total number of C datasets are collected from C 
independent experiments, we put a subscript [c] to index 
the identification problem associated with the specific dataset 
obtained from experiment [c]. In what follows we gather in 
a matrix similar to the set of all candidate/possible 
basis functions that we want to consider during the identifi¬ 
cation. The identification problem is then written as: 

yW = AWwM + n = 1,..., nx, c = 1,..., C. (3) 

The solution to to is typically going to be sparse, 
which is mainly due to the potential introduction of non- 
relevant and/or non-independent basis functions in 

Since the linear regression problems in ([^ are inde¬ 
pendent, for simplicity of notation, we omit the subscript n 
used to index the state variable and simply write: 
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where xl ' = 
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is the state vector 


at time instant t. It should be noted that N, the number 
of basis functions or number of columns of the dictionary 


'Note that the covariance matrix is not necessarily diagonal. 


matrix AI"^! € ygj.y large. Without loss of 

generality, we assume = • • • = = M. 

Remark 1: The model class considered in Q can be 
enlarged in various ways. First, measurement noise, which 
is ubiquitous in practice, can be accounted for using the 
following linear measurement equation: 

zt = xt + et, (6) 

where the measurement noise e* is assumed i.i.d. Gaussian. 
Under this formulation, the noise-contaminated data Zt rep¬ 
resents the collected data rather than Xt in Second, the 
additive stochastic term In Q is often used to model dy¬ 
namic noise or diffusion. In many practical application, how¬ 
ever, it is necessary to account for multiplicative noise in¬ 
stead of additive noise. Multiplicative noise can be accounted 
for by replacing Q with Xt = /(xt, Ut)v -I- fi(xt, ut)^t. In 
SI Section we show how the framework presented here 
can be modified to encompass these extensions. 


III. Identification from multiple datasets 


To ensure reproducibility, experimentalists repeat their ex¬ 
periments under the same conditions, and the collected data 
are then called “replicates”. Typically, only the average value 
over these replicates is used for modelling or identification 
purposes. In this case, however, only the first moment is used 
and information provided by higher order moments is lost. 
Moreover, when data originate from different experimental 
conditions, it is usually very hard to combine the datasets 
into a single identification problem. This section will ad¬ 
dress these issues by showing how several datasets can be 
combined to define a unified optimisation problem whose 
solution is an identified model consistent with the various 
datasets available for identification. 

To consider heterogeneous datasets in one single formu¬ 
lation, we stack the various equations in Q (see Eq. Q). 
Each stacked equation in Eq. 0 corresponds to a replicate 
or an experiment performed by changing the experimental 
conditions on the same system. 

In Eq. 0, Ai = blkdiag[A,^^],..., and Wi = 

..., for i = \,... ,N. Based on the stacked 

formulation given in Eq. 0 we further define 
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which gives 


y = Aw -f 


( 9 ) 


This yields a formulation very similar to that presented 
previously for a single linear regression problem. However, 
in the multi-experiment formulation 0. there is now a 
special block structure for y, A and w. 

Remark 2: When wt'l is fixed to be w for all the exper¬ 
iments, i.e. = • • • = = w, we can formulate the 
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identification problem as a single linear regression problem 
by concatenation: 
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To incorporate prior knowledge into the identification 
problem, it is often important to be able to impose constraints 
on w. In biological systems, positivity of the parameters 
constituting w is an example of such constraints. The other 
motivation for constrained optimisation comes from stability 
considerations. Typically, the underlying system is known a 
priori to be stable, especially if this system is a biological 
or physical system. A lot of stability conditions can be 
formulated as convex optimisation problems, e.g. Lyapunov 
stability conditions expressed as Linear Matrix Inequalities 
[ 8 ], Gershgorin Theorem for linear systems [9], etc. Only few 
contributions are available in the literature that address the 
problem of how to consider a priori information on system 
stability during system identification [10], [11]. To be able 
to integrate constraints on w into the problem formulation, 
we consider the following assumption on w. 

Assumption 1: Constraints on the weights w can be de¬ 
scribed by a set of convex functions: < 0, Vi; 

= 0, Vj, where the convex functions —>■ 

K are used to define inequality constraints, whereas the 
convex functions —)■ M are used to define equality 

constraints. 


IV. Methods 

To get an estimate of w in Q, we use Bayesian modelling 
to treat all unknowns as stochastic variables with certain 
probability distributions [13]. For y = Aw-|-^, it is assumed 
that the stochastic variables in the vector ^ are Gaussian 
distributed with unknown covariance matrix 11 , i.e., ^ ~ 

Af(o,n). 

In what follows we consider the following variable sub¬ 
stitution for the inverse of unknown covariance matrix or 
precision matrix: S = 11“^. In such case, the likelihood of 
the data given w is 


F(y|w) = A/'(y|Aw,n) oc exp 



y)^S(Aw 
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A. Sparsity Inducing Priors 

In Bayesian models, a prior distribution V{w) can 


defined as 7^(w) 

= where V{w 


TT^ 

= Uj=i exp 



, with being a given function of w^\ 

Generally, w in (|^ is sparse, and therefore certain spar¬ 
sity properties should be enforced on w. To this effect, 
the function g{-) is usually chosen to be a concave, non¬ 
decreasing function of iVi [ 6 ]. Examples of such functions 
g{-) include Generalised Gaussian priors and Student’s t 
priors (see [ 6 ], [14] for details). 

Computing the posterior mean E(w|y) is typically in¬ 
tractable because the posterior 7^(w|y) is highly coupled 
and non-Gaussian. To alleviate this problem, ideally one 
would like to approximate 7^(w|y) as a Gaussian dis¬ 
tribution for which efficient algorithms to compute the 
posterior exist [13]. For this, the introduction of lower 
bounding super-Gaussian priors i.e., = 

max-y;>o A/'(wp^| 0 , 7 i)(^( 7 i), can be used to obtain an an¬ 
alytical approximation of 7^(w|y) [14]. 

Note that problem (|^ has a block-wise structure, i.e. the 
solution w is expected to be block-wise sparse. Therefore, 
sparsity promoting priors should be specified for 'P{wi), Vi. 
To do this, for each block w^, we define a hyper-parameter 
7 i such that 


•PCw,) = maxA/'(w,|0,7iIc)</5(7z) 

7i>0 

C 

= max Y[ A/'(wp' |0,7i)v’(7i), 

i=i 
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where is a nonnegative function, which is treated as 

a hyperprior with 7^ being its associated hyperparameter. 
Throughout, we call (p{'yi) the “potential function”. This 
Gaussian relaxation is possible if and only if logV(y/wi) 
is concave on (0,oo). Defining 

...,7i] e Ti = diag [ 7 ,], 

7 = [ 71 , ■ • ■, 7iv] G r = diag [ 7 ], 


( 13 ) 


















































we have 


^(w) = TT 'P(wi) = maxA/’(w|0, r)93(7). (14) 

“y '>0 

i = l 


Using basic principles in convex analysis, we then obtain 
the following analytic form for the negative gradient of r!(7) 
at 7 (using the chain rule): 


B. Cost Function 

Once we introduce the Gaussian likelihood in and 
the variational prior in d, we can get the following 
optimisation problem jointly on w, 7 and S. 

Proposition 1 : The unknowns w,7, S can be obtained by 
solving the following optimisation problem 

£(w, 7, S) = min {— log |S| + log |r| + log |r“^ + A^SA| 

w.^,S 


ot’^ = — Vt,ii(7, 


=v^ [log |r-i + A^S'AI + log |r|] 

= diag{[-(r'=)-i + A^S''a]“'} . diag{-(r")-2} 

+ diag-i{r'=} (20) 
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where F is given in ( fO] ). 

Proof: The derivation can be found in the SI The 
proof mainly relies on using marginal likelihood maximisa¬ 
tion. ■ 

C. Algorithm 

The cost function in is convex in w and S but 
concave in F. This non-convex optimisation problem can be 
formulated as a convex-concave procedure (CCCP). It can be 
shown that solving this CCCP is equivalent to solving a series 
of iterative convex optimisation programs, which converges 
to a stationary point [ 15 ]. Let 

u(w, 7, S) = (y — Aw)^ S (y — Aw) -f w^F~^w — log det 
vh, S) = - log |F| -f log |F"i + A^SA| -f ) >(7j) 


Therefore, the iterative procedures ( [T 7 ] l and ( [T8] l for w^+^ 
and 7^+^, respectively, can be formulated as 




= argmin (y — Aw) (y — Aw) 

■ 7 ^ 0 ,W 
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The optimal 7 components are obtained as: 7^ = 
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. If 


7 is fixed, we have w''+^ by solving optimisation problem 
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min (y - Aw)^ (y - Aw) -f 2 V |j6»f • w^Ha, 

W 

i—1 

. ( 22 ) 
\Vhere 9 ^ = Ca^. We can then inject this into the expression 
of 7i, which yields 


7^^ = 




It is easy to check that ^(7, S) is a convex function with 
respect to 7. Furthermore, log | • | is concave in the space 
of positive semi-definite matrices. Since we adopt a super- 
Gaussian prior with potential function i^(7j), Vj, as described 
in ([T2|, a direct consequence is that p{'^j) = — logV 5 { 7 j) 
is concave, and, therefore, —pi'jj) is convex [ 5 ]R Note 
that u.(w, 7, S) is jointly convex in w, 7 and S, while 
u(7, S) is jointly convex in 7 and S. As a consequence, 
the minimisation of the objective function can be formulated 
as a concave-convex procedure 


^/^c 


( 23 ) 


«(w,7,S)-u(7,S). 

7 ^0,S^0.w 


After we get w^+^ and 7^“*"^, we can proceed with the 
optimisation iteration in ( [T 9 l ): 

=-Vsu( 7 ^ 8 ^=) 

= Vs (logdet (F-'= -f A^S'^A)) ( 24 ) 

= A(F-*^ -f A^S'^Aj-^A^. 

Letting = (Aw*^+^ — y) • (Aw*^+^ — y)^, we can get 

an estimate of the inverse of covariance matrix S as: 

= argmin Tr (SY'=+i) - log det S + Tr (A'^S) . 


( 16 ) 


s^o 


Since ^(7, S) is differentiable over 7, the problem in ( 16 1 
can be transformed into the following iterative convex opli- 
misation problem 


w®”^ = argmin m(w, 7 
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fc +1 
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= argmin m(w*^, 7, S*^) — , S*’)^7 


7^0 

= argmin^(w^, 7^, S) — Vst'(7^, S'^)' S. 
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^In this paper, the prior is chosen as a Student’s t prior thus p(7j) = 1. 


( 25 ) 

Given 7^+1 in ( | 2 ^ and in we can then go back 
to ( |20l ) to update oc for the next iteration. 

This above described iterative procedure for identification 
is summarised in Algorithm [T] below. Some further discus¬ 
sion can be found in SI Section IS6I 

D. ADMM Implementation 

Essentially, Algorithm ([^ consists of a reweighted Group 
Lasso algorithm ( [ 26 ] ) and a reweighted inverse covariance 
estimation algorithm Algorithm 0 can be imple¬ 

mented using the Alternating Direction Method of Multi¬ 
pliers (ADMM) [ 19 ]. This ADMM parallelisation allows to 












Algorithm 1 Nonlinear Identification Algorithm using Het¬ 
erogeneous Datasets 


Collect C heterogeneous groups of time series data from 
the system of interest (assuming the system can be 
described by ([T]i); 

Select the candidate basis functions that will be used to 


construct the dictionary matrix described in Section III 


3: Initialise 0° = 1, Vi, a? = S° = I, A° = I; 

4: for A: = 0,, fcmax do 

5: can be obtained by solving the following 

weighted minimisation problem over w, subject to the 
convex constraints in Assumption 

1 ^ 
min o (y - Aw)^ s'" (y - Aw) + ^ ||6lf ■ Wi|| 2 ; 

i = l 

( 26 ) 

6 : Update using ( |2^ ; 

7: Let y) • (Aw'^+i - y)^; 

8 : can be obtained by solving the following 

weighted minimisation problem over the inverse of 
the covariace matrix; 

mmTr(Y'=+i-bA'=)S-logdetS; ( 27 ) 

9: Update using ( |20l l; 

10 : Update 

11 : Update A'^^^ using ( |24| ; 

12 : if a stopping criterion is satisfied then 

13: Break; 

14: end if 

15: end for 


distribute the algorithmic complexity to different threads and 
to build a platform for scalable distributed optimisation. This 
is key to be able to deal with problems of large dimensions. 
More details can be found in SI Section ISTl 


E. Connection to SDP formulations and the sparse multiple 
kernel method 

The iteration in can be rewritten in the following 
compact form 

= argmin(y - Aw)^ (y - Aw) 

'y^0,w 

+ w^r~^w — S^)^7. 

(28) 

This is equivalent to the following SDP by using the standard 
procedure in [17]. 


min 

z.w.'y 
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y-Aw (S'=)-i 

w 0 
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The cost of solving this SDP is at least as well as 
M. Therefore, solving this SDP is too costly for all but 
problems with a small number of variables. This means that 


the number of samples, the dimension of the system, etc., 
can not be too large simultaneously. In this SDP formulation, 
r is closely related to the sparse multiple kernel presented 
in [18]. Certain choice of kernels may introduce some good 
properties or help reduce algorithmic complexity. In our case, 
we choose F to have a diagonal or a DC kernel structure. 

V. Simulations 

In this section, we use numerical simulations to show 
the effectiveness of the proposed algorithm. To compare the 
identification accuracy of the various algorithms considered, 
we use the root of normalised mean square error (RN- 
MSE) as a performance index, i.e. RNMSE = ||westimate — 
Wtrue|j 2 /||wtrue|| 2 - Several factors affect the RNMSE, e.g. 
number of experiments C, measurement noise intensity, 
dynamic noise intensity, length of single time series data 
M, number of candidate basis functions N. Eor brevity of 
exposition, we only show results pertaining to change of 
RNMSE over number of experiment C and length of single 
time series for one experiment, all in the noiseless case. More 
results related to other factors that may affect RNMSE will 
be shown in a future journal publication presenting these 
results in more details. 

As an illustrative example, we consider a model of an 
eight species generalised repressilator [ 20 ], which is a system 
where each of the species represses another species in a 
ring topology. The corresponding dynamic equations are as 
follows: 

= „Pi3^_LVpi3 +Pl4-Pl5Xlt, 

B. “ . (2!>) 

^it — Pi4 Pib'^itt '7*7 — 2, ... 8 , 

Pi2 + 

where pij, i = 1,..., 8 , j = 1,..., 5. We assume the 
mean value for these parameters across different species and 
experiments are pn = 40, pi 2 = 1, pa = 3, pn = 0.5, 
Pib = 1, VL We simulate the ODEs in ( |29l l to generate 
the time series data. In each “experiment” or simulation 
of ( |29| ), the initial conditions are randomly drawn from a 
standard uniform distribution on the open interval (0,1). The 
parameters in each experiment vary no more than 20 % of 
the mean values. In MATLAB, one can use (0 .8 + 
0 .4* rand { 1 ) ) to generate the corresponding parameters 
for each experiment. 

The numerical simulation procedure can be summarised 
as follows: 

1) The deterministic system of ODEs is solved 
numerically with an adaptive fourth-order Runge-Kutta 
method; 

2) As explained in Gaussian measurement noise with 
variance is added to the corre^onding time-series 
data obtained in the previous stepR 

3) The data is re-sampled with uniform interval^ 

4) The local polynomial regression framework in [21] is 
applied to estimate the first derivative; 

^In the example presented here, we consider the noiseless case corre¬ 
sponding to (T = 0. 

'^In this example, interval length is set to 1. 














(a) Group Lasso (first iteration of (b) Algorithm with maximal it- 
Algorithm [TJ. The minimal RN- eration number fcmax = 5. The 
MSE is around 0.75 minimal RNMSE is almost 0. 

Fig. 1. Algorithm comparison in terms of RNMSE averaged over 
50 ind epen dent experiments. An enlarged version can be found in SI 
Section IS8I 


5) A dictionary matrix is constructed as explained in 
Section [III) 

6) Algorithm [T] is used to identify the model. 

Following the procedure described in Section II, the can¬ 
didate dictionary matrix A in step 5) above is constructed 
by selecting as candidate nonlinear basis functions typi¬ 
cally used to represent terms appearing in ODE models 
of Gene Regulatory Networks. As a proof of concept, we 
only consider Hill functions as potential nonlinear candidate 
functions. The set of Hill functions with Hill coefficient 
h, both in activating and repressing form, for the state 
variables at time instant tk are: 

h\\\{x^t, K, hnum, hden) = , - — (30) 

+ xT" 


where hnum and hden represent the Hill coefficients. When 
hnum — 0, the Hill function has a repression form, whereas 
an activation form is obtained for /inum = ^den ^ 0. 

In our identification experiment, we assume hnum, hden 
and K to be known. We are interested in identifying the 
regulation type (linear or Hill type, repression or activation) 
and the corresponding parameters pn, the degradation rate 
constant and the basal expression rate pi^, Vi. Since there 
are 8 state variables, we can construct the dictionary matrix 
A with 8 (basis functions for linear terms) -|-(2 * 8) (basis 
functions for Hill functions, both repression and activation 
form) -fl (constant unit vector) = 25 columns. The cor¬ 
responding matrix A is given in Eq. ( |S8.1| l in Supporting 
Information Section ISSl 

Eor a fixed number of experiments C and length of 
single time series M, we compute the RNMSE over 50 
simulations by varying initial conditions and parameters pij. 
The RNMSE over C and M are shown in Eig. |2(a)| and 
Eig. 2(b) using both group Lasso and Algorithm [IJ with 
the maximal iteration number /cmax = 5 (see line 4 in 
Algorithm [T]). Inspection of the results presented in Eig. 2(a) 


and Eig. 2(b) clearly show that Algorithm [T] outperforms 
significantly group Lasso in terms of RNMSE. 


VI. Discussions 

There are several issues that we plan to further explore 
in the future. Eirst, we are working on establishing the 
minimal sampling rate necessary to yield adequate numerical 


estimates of the first derivative matrix (see Eq. 0). Second, 
further results not shown in this paper indicate that RNMSE 
is high when dynamic noise and measurement noise are high. 
We are currently working on finer characterisation of the 
“quality” of the identification in terms of the Signal-to-Noise 
ratio. 
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A. Notation 


Supporting Information 


Notation: The notation in this paper is standard. Bold symbols are used to denote vectors and matrices. For a matrix 
A G Aij G K denotes the element in the row and column, A^^: G denotes its row, A.j G 

denotes its column. For a column vector a. G ai denotes its element. In particular, 1^ denotes the identity 

matrix of size L x L. We simply use I when the dimension is obvious from context. ||w||i and ||w ||2 denote the £i and 
£2 norm of the vector w, respectively. ||w||o denotes the £0 norm of the vector w, which counts the number of nonzero 
elements in the vector w. diag [ 71 ,... , 7 Ar] denotes a diagonal matrix with principal diagonal elements being 71 ,... , 7 Ar. 
E(q:) stands for the expectation of the stochastic variable ct. (x means “proportional to”. blkdiag[A[^l,..., AI*^!] denotes a 
block diagonal matrix with principal diagonal blocks being A^^l,..., A^*^! in turn. Tr(A) denotes the trace of A. A matrix 
A ^ 0 means A is positive semidefinite. A vector 7^0 means each element in 7 is nonnegative. 


B. Discussion on Selection of Basis Fucntions 

Some a priori knowledge of the field for which the models are developed can be helpful. Indeed, depending on the field 
for which the dynamical model needs to be built, only a few typical nonlinearities specific to this field need to be considered. 
For example, the class of models that arise from biochemical reaction network typically involves nonlinearities that capture 
fundamental biochemical kinetic laws, e.g. first-order functions /([S']) = [S'], mass action functions /([Si], [S 2 ]) = [Si] • [S 2 ], 
Michaelis-Menten functions /([S]) = Fmax [S] /{K + [S]), or Hill functions /([S]) = I4iax [S]^ /{K^ + [S]^). 


C. Estimation of the First Derivative 


Estimating time derivatives from noisy data in continuous-time systems can either be achieved using a measurement 
equipment with a sufficiently high sampling rate, or by using state-of-the-art mathematical approaches [21]. Estimation of 
derivatives is key to the identification procedure [21]. As pointed out in [1], the identification problem is generally solved 
through discretisation of the proposed model. Assuming that samples are taken at sufficiently short time intervals, various 
discretisation methods can be applied. Typically, a forward Euler discretisation is used to approximate first derivatives, i.e. 
Yi can be defined as 


A 


Yr = 


At 




At 


Ti M X 1 


In this paper, the local polynomial regression framework in [21] is applied to estimate x(f). Eorward Euler discretisation 
and central difference discretisation are special cases of the local polynomial regression framework. 

Proposition 2 (Proposition 1 in [21]): Consider the bivariate data {ti,Yi),..., (tjvr,Y m)- Assume data are equispace- 
sampled and let A + — fc, the weights Wj are chosen as: 


Wi = 




k{k + l){kY2y 


j = l,...,k. 


Based on these weights, the first derivative can be approximated as: 


(S3.1) 


i=i 


Yj+j Yj-j 
£i+j ^i—j 


(S3.2) 


D. Derivation of Correlated Noise 

Without loss of generality and to ease notation, we consider the scalar case. In the scalar case, the system equation is 
written: 

xt= g{xt) Frixt, (S4.1) 

while the measurement equation is given as: 

yt=xt + et, (S4.2) 


where Ct is the measurement noise, which is assumed to be Gaussian i.i.d. We can simply use Taylor series expansion to 
expand g(xt) from yp. 

9{xt) = g(yt - et) 

=9ivt ) - 9 'iy)\v=yAt+0{e^it)\ 

Correlated Gaussian noise 


=9{yt)+Vyf 



Therefore, if we can estimate i from y properly, we can write the following 


-^ 

new noise 

= giyit)) + Ti{t)- 

Clearly, 77 (f) is not independent and identically distributed anymore. 


E. Derivation of the Cost Function 


To derive the cost function in Section IV-B we first introduce the posterior mean and variance 


niw — ) 

Sw = (ATSA + r^i)-i. 


Since the data likelihood 7^(y|w) is Gaussian, 

A/'(y|Aw, S"^) 


(27r)^/^|S|-i/2 

we can write the marginal likelihood as 


exp 


(y - Aw)^ S (y - Aw) 


N 


A/*(y|Aw, n)A/*(w|0, T) n 


3 = 1 


N 


'(27r)“/^ |S|-i/2 (27r) 


N 


exp{-E(w)}dw P ipijj), 
i=i 


where 


Equivalently, we get 


E{-w) = - (y — Aw)^ S (y — Aw) + -w^F ^w. 


^^(w) =-(w-rriw) (w-riiw)+E^(y), 
where niw and Sw are given by ( S5.1| ) and ( |S5.2 i. 

We first show the data-dependent term is convex in w and 7 . From (S5.1i, (S5.2l, the data-dependent term 
re-expressed as 

E{y) (y^Sy - yTSASwA^Sy) 

= l (y^Sy - yTSASwS-^SwA^Sy) 


= ^ (y - Ariiw)^ S (y - Aniv 


1 


mJ,r ^niv 


^ (y - Aw)^ S (y - Aw) + V 


Using ( |S5.6[ ), we can evaluate the integral in ( |S5.4[ ) to obtain 

y exp{-T;(w)}dw = exp{-£;(y)}(27r)^|Sw|^/^ 


Applying a —21og(-) transformation to (S5.4i, we have 

- 2 log ' 
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(27r)"/^|S|-i/2 (27r) 


N 


exp{-E(w)}dw P 

3 = 1 


=(M + N) log2^-log|S| +log|r| -f log|r-i -f A^SA| 


N 


+ + (y — Aw)^ S (y — Aw) 4 - w^F ^w. 

3=1 


(S4.4) 

(55.1) 

(55.2) 

(55.3) 

(55.4) 

(55.5) 

(55.6) 

can be 

(55.7) 

(55.8) 


(S5.9) 


























Therefore we get the following cost function to be minimised in ( [T5] l over w, 7 , S 

£{w,-y,S) 

= - log |s| + log |r| + log |r-i + A^SA| 

N 

+ (y — Aw)^ S (y — Aw) + w^r“^w + 

F. Some Discussion on Algorithm 

Remark 3: It should be noted that when noise is Gaussian i.i.d. with known variance, sparse Bayesian learning algorithms 
are provably better than classic Group Lasso algorithms in terms of mean square error [16], 

Remark 4: The initialisation step is important (line 3 of in Algorithm^. In special cases where the process noise in 0 
is Gaussian i.i.d and there is no measurement noise, S can be fixed to A ~I/or all k, where A is a positive real number, i.e. 
no update through is carried out. In such situations, A can be treated as the equivalent of the regularisation/trade-off 
parameter in a Group Lasso algorithm 0 - In such cases, cross validation can be implemented through variations of the 
initialisation values. 

Remark 5: When the model obtained is used for prediction purposes, the inverse covariance estimation procedure in 0 
can be used for quantification of the prediction uncertainty or risk. 


G. ADMM Implementation 

Essentially, Algorithm 0 consists of a reweighted Group Lasso algorithm ( |26) and a reweighted inverse covariance 
estimation algorithm 0 . Algorithm 0 can be implemented using the Alternating Direction Method of Multipliers (ADMM) 
[19]. This ADMM parallelisation allows to distribute the algorithmic complexity to different threads and to build a platform 
for scalable distributed optimisation. This is key to be able to deal with problems of large dimensions. ADMM can be used 
to obtain solutions to problems of the following form; 


min f{w)+g{z), 

W 

subject to Pw + Qz = c, 


(S7.1) 


where w S M" and z S M™, P S Q e and c S The functions /(•) and g{-) are convex, but can be 

nonsmooth, e.g. weighted norm. The first step of the method consists in forming the augmented Lagrangian 

-^-p=/(w)+5(z)+u^(-Pw + Qz-c)+ 
p/2||Pw PQz- c\\l. 


After that optimisation programmes with respect to different variables can be solved separately as follows; 


;= argmin [/(w) + ^||Pw + Qz^ - c + 

;= argmin ^^(z) + + Qz - c + u^|| 2 ^ 

;= + Pw’’“''^ + Qz^^^ — c. 

If g(z) is equal to A|| 2 ;||i, then the update on z is simply 

z^+i = S'a/p(Pw"'+^ + u"'), 
where ^A/p is the soft thresholding operator defined as 

{ X — \/p if X > X/p 
0 if \x\ < A/p 

X + A/p if a; < —A/p 
or 

S\/p{x) = max(0, x — A/p) — max(0, —x — A/p). 

Based on the above, we can design a simple algorithm to solve a nonsmooth optimisation problem in a decentralised fashion. 
Moreover, this algorithm converges provided the following stopping criterion is satisfied; 

||W^ - Z^||2 < Cpnmal, ||p(z^ “ Z^“^)||2 < ^duah 

where, the tolerances Cprimai > 0 and Cduai > 0 can be set via an “absolute plus relative” criterion, e.g. 


eprimal = V^^abs + e^-e^ max(|| || 2 , 
^dual — y/^^abs “t“ C7-e^p|ju ||, 



A = 


Xll 


XlM 


*81 hill(a;ii, 1,0,3) 
xsM hill(*iM, 1, 0, 3) 


hill(*8i.l,3,3) 1 

hill(a;8M, 1, 3, 3) 1 




(S8.1) 
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(a) Root of Normalised Mean Square Error averaged over 50 inde- (b) Root of Normalised Mean Squai'e Error averaged over 50 indepen- 
pendent experiments using group Lasso algorithm (first iteration of dent experiments using Algorithm with maximal iteration number 
Algorithm!^ fcmax = 5. 

Fig. S2. Algorithm comparison in terms of RNMSE. 


where eabs and erei are absolute and relative tolerances. More details can be found in [19]. 
More specifically, step (|27ll can be solved using ADMM instead: 


N 


(y - Aw)^ (y - Aw) + 2 ^ ||zi| 


2i 


subject to — Zi = 0, i = 1,..., V, 

The optimisation programmes with respect to different variables can be solved separately as follows: 

= (A^S'^A + pi)”' (A^SV + u'^)), 

z[+i=5vp(0fw[+i+u"), * = !,..., A, 




- — 


.r+l 


S is the vector soft thresholding operator : 


(S7.3) 


5,,(a) = (l-K/||a||)+a, (S7.4) 

where 5 k^( 0) = 0. This formula reduces to the scalar soft thresholding operator when a is a scalar. More details can be 
found in [19]. 


H. Simulation 

In this section, we put the constructed dictionary matrix and two enlarged figures. 
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